Creditworthiness

Contents

  1. Problem Statement
  2. Libraries
  3. Data Collection
  4. Data Cleaning - Benchmark
  5. Data Exploration
  6. Data Preparation - Benchmark
  7. Model Building - Benchmark
  8. Further Preparation of Dataset - No Imputation
  9. Model Building
  10. Feature Engineering
  11. Model Optimisation
  12. Summary

Problem Statement

To build a predictive model based on whether the applicant would default or fully pay back the loan. The target variable 'outcome' indicates whether the loan has been paid off or defaulted. Hence, it is a binary classification problem. The dataset contains 78 columns and 41,029 rows of data, each one representing information on a funded loan.

Business Application: Companies, especially finanial institutions, can increase their profit margins when a model can find creditworthy consumers whilst lowering their default risk. I will apply a range of machine learning models to be able to try and achieve this. Logistic Regression as well as tree-based models such as Random Forest will be applied.

Libraries

Data Collection

Parallel batch processing can be applied to be able to speed up the loading of the file if large. Since it is a small dataset it has not been applied.

Data Cleaning - Benchmark

This part involves minimal cleaning/ preparation of data so that it can be passed through machine learning models for benchmark purposes.
Further cleaning will be carried out after getting model benchmark results.

Description of columns are in 'Data Dictionary' sheet within the excel file.

Note: Re. earliest_cr_line column

""" The definition of this variable states 'The month the borrowers earliest reported credit line was opened'. However, the data given includes month and year in datetime format. We will currently leave it as it is for analysis purposes and change it later to two columns: month and year.

"""

Data Exploration

Overview

Charts and Graphs

Univariate

2.5% of the individuals have defaulted on loans.

The mean loans given were USD15,000 with the max being USD40,000 and min being USD1,000.

From the above chart, approx. 25% of data has more than 50% null values with the largest null value column being months since last record.

The bar chart above shows monthly credit line opening was quite constant across the year. The total count was approx. 10% higher (from the yearly average) between August and October.

The graph above shows which year the most credit lines were opened. Most credit line openings are centred +/- 5 years of year 2000. It shows a left-skewed distribution.

From the pie chart above, mortgage, rent and owning the property is what the majority fall under. Other, Any and None are negligible.

From the pie chart above, the most common purpose for loans is debt consolidation, credit card and home improvements. These 3 reasons represent approx. 86%. The least ones are relating to educational, renewable energy and weddings.

The state that most people live in is california and the least is Idaho. The total people living in California are approximately twice as much as Texas and New York.

The top profession is that of a managerial position. I am unsure as to what RN stands for but I assume it would be Registered Nurse. If it is a short form of Registered Nurse then this data needs to be aggregated. Other issues that could arise includes spelling mistakes and punctuation (uppercase and lowercase letters). Also, the data can further be explored by grouping similar roles together. For example,

The bar chart above shows that the most in employment who needed loans were individuals who had 10+ years of experience

None of the above charts have Gaussian distribution. Two columns, il_util and all_util, were further checked by looking at Q-Q plots and through normality tests such as Kolmogorov-Smirov test since they looked similar to having normal distribution.

Bivariate/ Multivariate

Analyzing the distribution of loan terms (e.g. 36 months vs. 60 months) for 'Fully Paid' and 'Defaulted' loans can provide insights into whether loan duration influences repayment behavior. From the bar chart above, regardless of term length, the number of loans defaulted were similar. However, in terms of percentage, total individuals who defaulted on 36-month term were significantly less than those on 60-month term with the ratio being 2:3.9.

Whether home ownership influences repayment behaviour can be determined by comparing loan repayment results across various statuses (rent, own, mortgage, other). From the chart above, there are more renters who defaulted than fully paid. In terms of distribution, 'rent' and 'own' seem to be similar.

The correlation matrix above shows the relationship between variables that are highly correlated (with threshold value of 75%). There are 4 pairs that stand out with correlation of more than 95%.

We can analyze how loan outcomes (fully paid or defaulted) are distributed across revolving line utilization rates and loan amounts. This can provide insights into whether borrowers with higher loan amounts or higher revolving line utilization rates are more likely to default on their loans. Lenders can use this information to assess the riskiness associated with lending to borrowers. From the above scatter plot, revolving line utilization rate is much more denser between 60% and 80% within loan amount of USD 10k - 20k.

This visualization helps assess the association between borrowers' credit behavior, credit capacity, and loan repayment outcomes, providing valuable insights for risk assessment and lending strategies. The utilization rate has a mean value of 54.7%, lower quartile of 36.5% and higher quartile of 74%.

The chart shows the densely populated region ranges between lower and upper quartile of revolving utilization rate indicating that borrowers with high credit utilization rates and lower credit capacity are more likely to default on their loans. In general, defaulters are spread across the full spectrum of utilization rate with very few close to USD 1M credit limit.

This scatter plot can provide insights into the relationship between a borrower's financial situation (represented by total current balance and total collection amounts) and their likelihood of loan default. The chart shows that the majority of defaulters had a total current bank balance of less than USD 0.5M and less than USD 5k in total amount ever owed. There is one defaulter (outlier) who's amount owed is over 50k.

Data Preparation - Benchmark

Model Building - Benchmark

Functions for Obtaining ML Results

Functions for confusion matrix and roc_auc chart

ML Results

Precision: tells us how precise or accurate the model is when it predicts positive instances. XGBoost's precision of 97.4% suggest that improvement could be made to increase profit margin.

Recall: tells us how well the model captures all positive instances in the dataset.

ROC-AUC (Receiver Operating Characteristic - Area Under the Curve):

  1. ROC-AUC score measures the area under the curve, which is a graphical representation of the trade-off between true positive rate (sensitivity) and false positive rate (1 - specificity) across various thresholds.
  2. A ROC-AUC score of 0.5852 for XGBoost indicates that the model's ability to distinguish between the positive and negative classes is slightly better than random chance and that there is room for improvement.

Since the data is imbalanced, focus will be on precision, recall (sensitivity) and F1-score.

Precision, recall and F1 formulas are based on below:

True Positive (TP): The model predicts that the loan will be repaid and the original outcome in the dataset is the same.

True Negative (TN): The model predicts that the loan will default and the original outcome in the dataset is the same.

False Positive (FP): The model predicts that the loan will be repaid, but the original outcome in the dataset is that it will default.

False Negative (FN): The model predicts that the loan will default, but the original outcome in the dataset is that it will not default.

Further Preparation of Dataset - No Imputation

In this section, we will get a completely clean dataset with no null values. After this, we can do feature engineering by adding columns we removed and imputing them or by adding other features.

Model Building

Feature Engineering

Now that we have our complete dataframe (new_df) with no null values and no imputation, we can now apply feature engineering techniques. This can include various imputation methods, log_transformations, feature crossing, binning/grouping and handling outliers. Removed columns can be re-added too.

Feature 1 - EMI

From the above roc_auc curve, XGBoost and Logistic Regression are very similar.

We will now drop this new variable because the correlation between those current features and this new one will be very high. Logistic regression assumes that the variables are not highly correlated.

Feature 2 - Emp_length

From the roc_auc curve above XGBoost is a higher curve than the rest and therefore is a better model than the others.

Model Optimisation

Fine-Tuning

Summary

From the results so far, LightGBM gives the best results. LightGBM gave a precision score of 97.53% with ROC_AUC Score of 0.6315 and XGBoost gave a precision score of 97.53% with ROC_AUC Score of 0.6137. To improve the results further, external factors could be considered such as marriage, gender and education. Other techniques can be used to balance the data such as SMOTE-NC, although it would take longer to train. With fine-tuning of XGBoost, score improved to 0.6223. Further fine tuning of lightGBM could be done too.

Businesses that want to maximize their profitability and minimize risk of loan default will benefit from LightGBM model over the other ones. A conservative approach prioritizes on minimizing default risk through stricter lending standards (higher credit score requirements, lower debt-to-income ratios), therefore resulting fewer loans approved. Hence, fewer defaults. In contrast, an aggresive approach favors higher recall (fewer false negatives). This means that institutions can make loans accessible to those identified as a good borrower even with a higher risk of default.

Institutions should focus on precision if they are to follow a conservative approach, otherwise they should focus on recall for an aggresive approach. Since the precision score for defaulters is 2.47% (100% - 97.53%), it signifies that LightGBM model will increase company profit margins by 0.03% (2.5% (current default rate) - 2.47%). However, this model might perhaps not be reliable due to low roc_auc scores.